EMMA: a new method for computing multiple sequence alignments given a constraint subset alignment

Background Adding sequences into an existing (possibly user-provided) alignment has multiple applications, including updating a large alignment with new data, adding sequences into a constraint alignment constructed using biological knowledge, or computing alignments in the presence of sequence length heterogeneity. Although this is a natural problem, only a few tools have been developed to use this information with high fidelity. Results We present EMMA (Extending Multiple alignments using MAFFT--add) for the problem of adding a set of unaligned sequences into a multiple sequence alignment (i.e., a constraint alignment). EMMA builds on MAFFT--add, which is also designed to add sequences into a given constraint alignment. EMMA improves on MAFFT--add methods by using a divide-and-conquer framework to scale its most accurate version, MAFFT-linsi--add, to constraint alignments with many sequences. We show that EMMA has an accuracy advantage over other techniques for adding sequences into alignments under many realistic conditions and can scale to large datasets with high accuracy (hundreds of thousands of sequences). EMMA is available at https://github.com/c5shen/EMMA. Conclusions EMMA is a new tool that provides high accuracy and scalability for adding sequences into an existing alignment. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-023-00247-x.


S1.1 UPP-add
UPP [Nguyen et al., 2015] is a method that was designed to align sequence datasets that have high sequence length heterogeneity.UPP uses two stages, where the first stage extracts and aligns a subset of the sequences deemed to be full-length, and the second stage adds in the remaining sequences into the alignment from the first stage (which we refer to as the "backbone alignment").In this paper, we focus on the second stage of UPP, which we will refer to as UPP-add.
Given a backbone alignment, UPP-add builds an ensemble of Hidden Markov Models (i.e., an eHMM) to represent the backbone alignment, which it then uses to add query sequences into the backbone alignment.To build the eHMM, UPPadd computes a backbone tree on the backbone alignment using the maximum likelihood heuristic FastTree2 [Price et al., 2010].Then, UPP-add recursively decomposes the backbone tree into smaller subtrees by deleting "centroid edges" until the last decomposition results in subtrees with at most A leaves (A = 10 in default UPP).Each subtree thus defines a subset of the backbone sequences and hence also a subalignment (induced by the backbone alignment on the subset of sequences).For each subalignment, an HMM is created using hmmbuild from the HMMER suite [Finn et al., 2011].Each query sequence is searched against all HMMs in the eHMM using hmmsearch (also from HMMER), and then mapped to the HMM with the highest bit-score.The alignment of the query sequence to the subalignment for the selected HMM is computed using hmmalign (also from HMMER).Because the subalignments are induced by the backbone alignment, this also defines an alignment of the query sequence to the backbone alignment, which we refer to as an "extended alignment".The set of all such extended alignments (one for each query sequence) are merged using transitivity to form the final alignment.

S1.2 WITCH-add and WITCH-ng-add
WITCH [Shen et al., 2022] and WITCH-ng [Liu and Warnow, 2023] are two MSA methods that use the same type of two-stage approach as UPP, where the first stage extracts and aligns a subset of the sequences, and the second stage adds the remaining sequences into the backbone alignment.Here we focus on the second stages of these methods, i.e., WITCH-add and WITCH-ng-add.
Like UPP-add, both WITCH-add and WITCH-ng-add compute the backbone tree and eHMM, and then perform an all-against-all search between all query sequences and all HMMs.Then, for a given query sequence q, they compute a weight between for each query-HMM pair based on an "adjusted bitscore", as described in Shen et al. [2022], and select the top k HMMs for that query sequence (k = 10 in default WITCH).As described in UPP-add, each such query-HMM pair defines a specific extended alignment for the query sequence, i.e., a specific way of adding the query sequence to the backbone alignment.WITCH-add and WITCH-ng-add each compute a consensus of these k extended alignments, and differ in how these consensus alignments are computed.WITCH-add computes a consensus of these k extended alignments using the Graph Clustering Merger (GCM), a technique from MAGUS [Smirnov and Warnow, 2021], while WITCH-ng-add uses a simple variant of Smith-Waterman [Smith and Waterman, 1981] to compute the consensus alignment.These consensus alignments thus define a single extended alignment for the query sequence to the backbone alignment.Finally, as in UPP-add, these extended alignments are merged together using transitivity.WITCH-add in general, has better alignment accuracy than UPP-add [Shen et al., 2022], but WITCH-ng-add is faster than WITCH and at least as accurate [Liu and Warnow, 2023].Thus, in this study, we examined WITCH-ng-add but not WITCH-add.

S1.3 Transitivity Merger
Transitivity is a mathematical property for some binary relations.It says that if X and Y are related, and if Y and Z are related, then X and Z must also be related.Since homology (i.e., descent from a common ancestor) is such a relationship, and since the objective of multiple sequence alignment is to detect homology, we will be able to use transitivity to compatible alignments that share some letters through transitivity.
The Transitivity Merger is used in any two-phase method that constructs an alignment on a subset and then adds each of the remaining sequences to the subset alignment.The collection of extended alignments (one such alignment for each remaining sequence) must be merged together, and the merging process is produced using transitivity.Thus, it is a fundamental step in UPP Nguyen et al. [2015].However, it is also used in PASTA Mirarab et al. [2015], which produces disjoint alignments, then merges pairs of these alignments, producing overlapping subsets that are compatible with each other.To merge this set of overlapping alignments, it uses transitivity.Thus, it is also a fundamental step in PASTA.
In Figure S2 we provide an example of how transitivity is used to merge two alignments, in the context of a two-phase method where the backbone alignment is divided into two subsets.The backbone alignment has four sequences and is split into two subalignments, each with two sequences.Then, two query sequences AACTA and AATCAA are added to these two subalignments, producing two extended subalignments.Since the two subalignments were taken from the backbone alignment, each site in each extended subalignment corresponds to either a specific site in the backbone alignment or corresponds to an insertion.In general, insertions can be between sites, but in the example shown, each insertion is at the end of the subalignment.We show how this information is used to merge the two extended subalignments.
The first A in query sequence AACTA is aligned to the first column of the first subalignment, which implies it will be placed in the first column of the final alignment, which will also contain the nucleotides from the first column of the backbone alignment.This is an application of transitivity.
The same argument shows that the first A in query sequence AATCAA is aligned to the first column of the second subalignment, which implies it also will be placed in the first column of the backbone alignment.Thus, the first As in these query sequences will be considered homologous to each other, as well as to all the nucleotides in the first column of the backbone alignment, and placed in the first column of the final alignment.This is also an application of transitivity.
In contrast, the last A in each of the two query sequences is placed in the last column by itself within their extended subalignments.This means each of these As are not considered homologous to any letter in their extended subalignments, and hence, they cannot be in a column with any other letter in the final alignment.Hence, the final alignment must have two final columns, each containing one of these two final As and no other letters (otherwise, some homology would be implied).However, the order of these two final columns is not determined, so the transitivity merge allows both outcomes.

S1.4 Adjusted Bitscore
The adjusted bitscore is given by the following equation: where H i represents a hidden Markov model (HMM), q is a query sequence to be added, d is the total number of HMMs, BS(H i , q) is the bitscore between H i and q (computed by HMMER [Finn et al., 2011]), and s i is the number of sequences in the alignment that generates H i .For additional details, see Shen et al. [2022], Section 2.4.

S2 Commands for software
All commands are based on the assumption that 16 CPU cores are available.

S2.1 Backbone generation and sequence-adding methods
1.For Experiment 0 only, we computed a MAGUS backbone alignment (GitHub version committed on April 5th 2021): 2. We used FastTree2 (v2.1 multi-threaded version) to generate a backbone tree from a given constraint (backbone) alignment for each dataset (-ntgtr for nucleotide or -lg for amino acids): 3. We used RAxML-ng (v1.0.3) to generate the reference tree for Rec and Res datasets (for selecting a clade of sequences in Experiment 2 -adding to a clade-based backbone).We ran bootstrapping on their reference alignments and used the first tree returned:

S2.2 Evaluation
1. FastSP (v1.7.1) to obtain SPFN and SPFP.We average these two results to produce an overall "Alignment error".The "-ml -mlr" options will ignore columns with lowercase letters (considered as insertions in UPP and WITCH): 2. Runtime is obtained by adding the following command before any software:  is specified with a parenthesis after the dataset name.The p-distance is defined as follows.The Hamming distance between two (aligned) sequences is defined as the number of columns that have different letters and neither is gapped.The normalized Hamming distance is the Hamming distance divided by the total number of columns where both sequences have letters, thus producing values between 0.0 and 1.0.This value is known as the p-distance.
The percent gaps is the percentage of the alignment that is occupied by "-".The Indelible, ROSE, and RNASim datasets are all simulated, and the remaining sequence datasets are biological.Except for Rec and Res, there are either true (known because simulated) or curated alignments on all the sequences in the dataset.The Rec and Res datasets have curated alignments only on a subset of the sequences (66 for Rec and 112 for Res).See the main paper for additional details.

S3.2 Dataset Generation: INDELible
Table S2: Indel rate and tree scale parameters for generating the 5000Mhet series data, where l is the tree scale parameter, defining the maximum path-length in the non-ultrametric model tree, and r defines the indel rate.
Condition l (tree scale) r (indel rate) We present details for generating our new simulated conditions using IN-DELible [Fletcher and Yang, 2009] below.Our simulation parameters are based on those of the 1000M series of the ROSE simulated dataset [Liu et al., 2009].We uploaded all of our INDELible control files generating this data to https://github.com/ThisBioLife/5000M-234-het to allow easy reproduction.

S3.2.1 Model trees
Our model trees were generated by a two-step process.We first generated random 5000-species birth-death trees (one tree per replicate) using the DendroPy [Sukumaran and Holder, 2010] treesim.birthdeath tree function, setting the birth rate initially at 1 with the standard deviation of the change to the birth rate set to 0.2 (see the documentation on this Dendropy function on the birthdeath process and the implication of this parameter), with a zero death rate.Then, taking only the topology of the tree generated in the prior step, we instructed INDELible to assign branch lengths to the tree such that the resulting tree is both non-ultrametric and has a maximum path-length l.We vary l to control the scale of the tree and hence the rate of evolution of the condition.The choice of l across conditions can be seen in Table S2.

S3.2.2 Indel length distribution & indel rates
Our model for sequence length heterogeneity assumes a base ("short indel") distribution of indel lengths.In our case, we simply took the "medium" length distribution from the "M"-series ROSE simulated datasets [Liu et al., 2009] as the short indel distribution.Given an indel event, with probability p = 0.85, it draws its length from the short indel distribution.Otherwise, it draws its length from a long indel distribution, in our case set to NB(130, 0.5).The indel length distribution is thus equivalently a mixture distribution of the short indel distribution (prior probability 0.85) and the long indel distribution (prior probability 0.15).We directly computed the probability mass function (PMF) of this mixture distribution, truncated the PMF, and fed the truncated PMF to INDELible as part of the input.The truncated PMF can be found alongside the uploaded control files in the Github repository linked.
Similar to the original ROSE dataset, we also varied the indel rates across conditions, with the choices for the indel rate r shown in Table S2.The indel rates were chosen analogous to the original indel rates of the ROSE dataset.

S3.2.3 GTR parameters
Table S3: GTR matrix parameters for generating the 5000M-het series dataset.The parameters are set according to the parameters for the ROSE simulated data [Liu et al., 2009].
The rest of the parameters (e.g., GTR+Γ parameters and the initial sequence length) were chosen to be the same as the ROSE simulated dataset generated for the SATé study [Liu et al., 2009] and can be found in the uploaded control files (https://github.com/ThisBioLife/5000M-234-het).We also provide the parameters here: • GTR parameters: see Table S3 • Stationary frequencies (TCAG): 0.311475, 0.191363, 0.300414, 0.196748 • α (for the gamma distribution): 1 • Initial sequence length: 1000

S3.3.1 Background
Software is available that maps mobile DNAs within bacterial and archaeal genomes [Hudson et al., 2015, Mageeney et al., 2020], where each mapping is associated with the sequence of the integrase enzyme that catalyzes the sitespecific integration of the mobile DNA.

S3.3.3 Evaluation
We compare the estimated alignments to the seed Pfam alignments for both Rec and Res to obtain SPFN, SPFP, and the expansion score (see main text for definitions).

S4 Computational issues for MAFFT-linsi-add
and MAFFT-add S4.1 Experiment 0: MAFFT-linsi-add scalability issue on 5000-taxon datasets In Experiment 0, we limited the runtime to 12 hours, 64 GB of memory, and 16 cores.We varied the number of queries to be added in this experiment.However, when we tried running MAFFT-linsi --add (MAFFT-linsi-add) on our training 5000M2 datasets with the full set of 4000 query sequences to a 1000-taxon fulllength backbone alignment, we encountered either out-of-memory issues (64 GB memory limit) or crashes.
The out-of-memory error message looks like the following: slurmstepd : error : Detected 1 oom -kill event ( s ) in StepId =5376434.batch cgroup .Some of your processes may have been killed by the cgroup out -of -memory handler .
The crash error message looks like the following: Command exited with non -zero status 1

S4.2 Experiment 2: MAFFT-add out-of-memory issues
In Experiment 2, we allowed 24 hours and 128 GB memory, as well as 16 cores.
batch .Some of your processes may have been killed by the cgroup out -of -memory handler .

Figure S1 :
Figure S1: Experiment 0: SPFN (top left), SPFP (top right), alignment error (average of SPFN and SPFP, bottom left), and runtime in hours (bottom right) of MAFFT-add and MAFFT-linsi-add for adding 100, 200, 1000, or 2000 sequences to a 1000-taxon backbone alignment.The dataset used is 5000M2-het with 10 replicates, where 1000 full-length sequences are randomly selected and aligned with MAGUS [Smirnov and Warnow, 2021] to form the backbone alignment.Averages over ten replicates are shown.Error bars shown for alignment errors are standard errors and standard deviations for runtime.We exclude replicate 4 because MAFFT-linsi-add encountered out-of-memory issues when adding 100 or 200 query sequences.Additionally, MAFFT-linsi-add either encountered out-of-memory issues or did not complete within 12 hours when adding 1000 or 2000 query sequences and thus is not shown.

Figure
Figure S2: A simple example of how transitivity merging works when merging two extended subalignments, each adding one query new sequence.Insertion letters are marked in bold and italicized.The two insertion letters, marked in the final columns of the two extended subalignments, are put into separate columns in the final alignment.See text for additional discussion.

Figure S3 :
Figure S3: Experiment 1: We show alignment error SPFN (top left) and SPFP (top right) as well as runtime (bottom) for EMMA run with different settings of l and u (the algorithmic parameters governing subset size).Parameters l and u are drawn from {10, 25, 50, 100}, and with l < u.This experiment was performed with a backbone of 1000 randomly selected sequences of 10 replicates of the 5000M2-het model condition.Alignment error is calculated on query sequences only, and white squares mark the averages.

Figure S7 :
Figure S7: Experiment 2: SPFN when adding to large random backbone alignments.The top panel denotes biological datasets, and the bottom panel denotes simulated datasets.MAFFT-linsi-add failed to finish within 24 hours for datasets except for 10AA and ROSE 1000M1-4 and was not run on Rec and Res due to their large numbers of sequences.MAFFT-add encountered out-of-memory issues on the Res dataset (Section S4).Failed runs, or runs not attempted, are marked with "X".

Figure S8 :
Figure S8: Experiment 2: SPFN when adding to small random backbone alignments.The top panel denotes biological datasets, and the bottom panel denotes simulated datasets.MAFFT-linsi-add failed to finish within 24 hours for datasets except for 10AA and ROSE 1000M1-4 and was not run on Rec and Res due to their large numbers of sequences.MAFFT-add encountered out-of-memory issues on the Res dataset (Section S4).Failed runs, or runs not attempted, are marked with "X".

Figure S9 :
Figure S9: Experiment 2: SPFN when adding to clade-based backbone alignments.The top panel denotes biological datasets, and the bottom panel denotes simulated datasets.MAFFT-linsi-add failed to finish within 24 hours for datasets except for 10AA and ROSE 1000M1-4 and was not run on Rec and Res due to their large numbers of sequences.MAFFT-add encountered out-of-memory issues on the Res dataset (Section S4).Failed runs, or runs not attempted, are marked with "X".

Figure S10 :
Figure S10: Experiment 2: Runtime (log-scale) in minutes when adding to small random backbone alignments.The top panel denotes biological datasets, and the bottom panel denotes simulated datasets.MAFFT-linsiadd failed to finish within 24 hours for datasets except for 10AA and ROSE 1000M1-4 and was not run on Rec and Res due to their large numbers of sequences.MAFFT-add encountered out-of-memory issues on the Res dataset (Section S4).Failed runs, or runs not attempted, are marked with "X".Error bars indicate standard deviation.

Figure S11 :
Figure S11: Experiment 2: Runtime (log-scale) in minutes when adding to clade-based backbone alignments.The top panel denotes biological datasets, and the bottom panel denotes simulated datasets.MAFFT-linsiadd failed to finish within 24 hours for datasets except for 10AA and ROSE 1000M1-4 and was not run on Rec and Res due to their large numbers of sequences.MAFFT-add encountered out-of-memory issues on the Res dataset (Section S4).Failed runs, or runs not attempted, are marked with "X".Error bars indicate standard deviation.

Figure S13 :
Figure S13: Experiment 2: Individual SPFN for 10AA datasets when adding to small random backbone alignments.

Table S1 :
Empirical statistics of study datasets."*" marks the training dataset.Numbers in parentheses after dataset names indicate the number of replicates.

Table S4 :
Average expansion scores of EMMA, WITCH-ng-add, MAFFT --add, and MAFFT-linsi --add for different model conditions on all datasets."X" denotes that the method failed due to runtime (24-hour limit) or outof-memory issues (128 GB).All values are rounded to the nearest tenth.

Table S5 :
Memory usage (mean, minimum, and maximum in GB) of EMMA, WITCH-ng-add, and MAFFT--add for different model conditions on all datasets.For min/max memory usage, we also show the specific dataset in parentheses.The maximum memory usage of MAFFT--add is the memory limit (128 GB) for Res, and led to out-of-memory crashes.MAFFT-linsi--add only ran on the datasets with at most 1000 sequences, and so its memory usage is not comparable to the other methods.All values are rounded to the nearest hundreds.